Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post flight task for firmware upgrades #139

Merged
merged 17 commits into from
Nov 2, 2020

Conversation

krihal
Copy link
Contributor

@krihal krihal commented Oct 18, 2020

In this first step the post-flight task will try to update the OS version of the devices.

The argument post_waittime specifies the time to wait (in seconds) before trying to gather the facts from the device(s), usually it will take around 5 - 6 minutes for an Arista switch to reboot.

@indy-independence
Copy link
Member

In what case would you not want to use the post_flight check? I feel like this should always be done, and probably doesn't even need an option to be able to disable it? Maybe the post_waittime doesn't need to be user specified either, it could be something like try once after 5 minutes, and if it fails re-check again every 5 minutes up to a maximum of 30 min or something

@krihal
Copy link
Contributor Author

krihal commented Oct 19, 2020

In what case would you not want to use the post_flight check? I feel like this should always be done, and probably doesn't even need an option to be able to disable it? Maybe the post_waittime doesn't need to be user specified either, it could be something like try once after 5 minutes, and if it fails re-check again every 5 minutes up to a maximum of 30 min or something

In the case where we do the upgrade in several steps. For example one might want to download the firmware image to all switches during daytime and do the upgrade itself later that night. In that case you want to run pre-flight and download in the first step and later that night run activate, reboot and post-flight.

Regarding wait-time, for Arista we can be somewhat sure that a default value of lets say 10 minutes would be good enough. But if we want to support other vendors in the future we definitely want a configurable option here. A good compromise for now might be to give it a default value of 10 minutes.

Sure, I can implement some logic for re-trying - or the user just run this with only the post-flight option set if it fails the first time.

@indy-independence
Copy link
Member

Yes ok that makes sense, but in that case maybe you could tie post_flight so it always runs when firmware is activated for example? (and pre_flight so it always runs before download?)
There's so many options the user has to consider now: download, activate, reboot, pre_flight and post_flight should be set to true or false. That's 2^5=32 different combination of options, and I'm not even sure all those combinations would make sense to use (and how to test all those combinations?)

@indy-independence
Copy link
Member

I think we can hide away some of the complexity in a WebUI for example, but there's still a few things that I'm considering:

  1. The default waittime of 0 seems odd to me, is there a reason not to set it to something like 600 sec?
  2. The post_flight check does not actually check if the selected firmware was activated?

@krihal
Copy link
Contributor Author

krihal commented Oct 20, 2020

  1. As I wrote in a previous comment: "A good compromise for now might be to give it a default value of 10 minutes." I'll push a fix with that set as a default value.
  2. The activate step will check that the new firmware was activated properly. If the switch says it was able to activate it I don't see a reason for checking it again. The reason for not checking it again is that it will require the user to pass the filename argument (since that is what we're using to verify that the correct firmware is active) if even if we just want to update the OS version on all switches. I see a use-case where some switches might have been upgraded manually and one just want to update the database with the new versions. But I can introduce the parameter post_flight_dont_check_activated_firmware_version to work around that.

@indy-independence
Copy link
Member

After some more discussion on the side about checking if the device booted with new firmware I propose:

  1. During the activate step, before updating the boot variable check if it's already set to the target firmware. If it is then raise something like FirmwareAlreadyActiveException
  2. In "def device_upgrade_task() if activate:", check for the specific exception above. If it's raised then skip reboot and post_flight steps
  3. In post_flight, check that prev_os_version and new os_version are not the same, and raise exception if they are

@krihal
Copy link
Contributor Author

krihal commented Oct 20, 2020

Ok...

@krihal krihal closed this Oct 24, 2020
@krihal krihal deleted the feature.post_flight_check branch October 24, 2020 11:07
@indy-independence indy-independence restored the feature.post_flight_check branch October 28, 2020 12:15
@indy-independence
Copy link
Member

Added check to see if OS version changed in post_flight
Improved logging for webUI and added total-count header for webUI progress bar

@indy-independence indy-independence merged commit 8394167 into develop Nov 2, 2020
This was referenced Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants